Load the dataset from kaggle api utlizing kaggle secrets file stored in .kaggle folder i.e. Json file.ΒΆ
InΒ [828]:
import kaggle
import os
import json
# Load Kaggle API credentials from the kaggle.json file
kaggle_json_path = r'C:\Users\krist\OneDrive\Desktop\Kaggle Practice Data Set\kaggle.json'
with open(kaggle_json_path, 'r') as f:
kaggle_credentials = json.load(f)
os.environ['KAGGLE_USERNAME'] = kaggle_credentials['username']
os.environ['KAGGLE_KEY'] = kaggle_credentials['key']
# Initialize Kaggle API
api = kaggle.KaggleApi()
api.authenticate()
# Example: Download a dataset
dataset = 'atharvasoundankar/chocolate-sales' # Replace with the dataset you want to download
path = r'C:\Users\krist\OneDrive\Desktop\Kaggle Practice Data Set' # Replace with your desired download path
# Download the dataset
api.dataset_download_files(dataset, path=path, unzip=True)
print(f"Dataset '{dataset}' downloaded to '{path}'")
Dataset URL: https://www.kaggle.com/datasets/atharvasoundankar/chocolate-sales Dataset 'atharvasoundankar/chocolate-sales' downloaded to 'C:\Users\krist\OneDrive\Desktop\Kaggle Practice Data Set'
Look at the data via pandas daframes and import all need libraries for Explortory Data Analysis. Then Start Data Cleaning before we start to do Exploratory Data Analysis.ΒΆ
InΒ [829]:
%%HTML
<script src="require.js"></script>
InΒ [830]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
import chart_studio.plotly as py
from IPython.display import HTML
import seaborn as sns
import datetime as dt
import numpy as np
import altair as alt
import sklearn as sk
pio.renderers.default='notebook'
# Load the data
dfChocolateSales = pd.read_csv('Chocolate Sales.csv')
# Display all columns
pd.set_option('display.max_columns', None)
#Display the data (top 5 rows of data frame)
dfChocolateSales.head()
Out[830]:
| Sales Person | Country | Product | Date | Amount | Boxes Shipped | |
|---|---|---|---|---|---|---|
| 0 | Jehu Rudeforth | UK | Mint Chip Choco | 04-Jan-22 | $5,320 | 180 |
| 1 | Van Tuxwell | India | 85% Dark Bars | 01-Aug-22 | $7,896 | 94 |
| 2 | Gigi Bohling | India | Peanut Butter Cubes | 07-Jul-22 | $4,501 | 91 |
| 3 | Jan Morforth | Australia | Peanut Butter Cubes | 27-Apr-22 | $12,726 | 342 |
| 4 | Jehu Rudeforth | UK | Peanut Butter Cubes | 24-Feb-22 | $13,685 | 184 |
In the table above we can see that Amount contains a $ sign and comma which mean it is a string and will need to be converted to a int so we can use it in our model.ΒΆ
InΒ [831]:
# Run to lookinfo at all data types within the dataset
dfChocolateSales.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1094 entries, 0 to 1093 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sales Person 1094 non-null object 1 Country 1094 non-null object 2 Product 1094 non-null object 3 Date 1094 non-null object 4 Amount 1094 non-null object 5 Boxes Shipped 1094 non-null int64 dtypes: int64(1), object(5) memory usage: 51.4+ KB
In the above output we see that there are 5 objects and one integer. However, Date needs to formated as a datetime, and amount as a integer.ΒΆ
InΒ [832]:
# Clean the Data
# Amount column is formated as a string and needs to be converted to an integer
## Remove the $ and , characters and convert the column to an integer
dfChocolateSales['Amount'] = dfChocolateSales['Amount'].str.replace('$', '').str.replace(',', '').astype(int)
# Convert the Date column to a datetime object
dfChocolateSales['Date'] = pd.to_datetime(dfChocolateSales['Date'])
# Convert Dates To Months
dfChocolateSales['Date'] = dfChocolateSales['Date'].dt.to_period('M').astype(str)
C:\Users\krist\AppData\Local\Temp\ipykernel_33408\14258022.py:8: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
Create Python Class with functions that look at unique total, sum, average.ΒΆ
InΒ [833]:
# Create Class of Column Value Counts that contains functions for totals, mean, sum, etc...
class Column_value_counts:
# Get the total unique values in a column
def Unique_Totals(self, column):
total_unique = pd.Series(column.unique()).count()
return total_unique
# Get Sums of Numeric Columns
def Sum_Numeric_Columns(self, column):
sum_column = column.sum()
return sum_column
# Get Group By Averaged columns
def Group_by_AVG(self, df, group_by_column, avg_column):
groupbyavg = df.groupby(group_by_column)[avg_column].mean().astype(int)
return groupbyavg
# Create an instance of the class
column_value_counts = Column_value_counts()
# Print the results
print(f"There is a total of {column_value_counts.Unique_Totals(dfChocolateSales['Sales Person'])} sales people in the dataset representing {column_value_counts.Unique_Totals(dfChocolateSales['Country'])} countries.")
print(f"There are {column_value_counts.Unique_Totals(dfChocolateSales['Product'])} products in the dataset.")
print(f"The total boxes shipped is {column_value_counts.Sum_Numeric_Columns(dfChocolateSales['Boxes Shipped'])} and the average chocolate boxes sold by each sales person is:\n{column_value_counts.Group_by_AVG(dfChocolateSales, 'Sales Person', 'Boxes Shipped')}")
There is a total of 25 sales people in the dataset representing 6 countries. There are 22 products in the dataset. The total boxes shipped is 177007 and the average chocolate boxes sold by each sales person is: Sales Person Andria Kimpton 165 Barr Faughny 148 Beverie Moffet 184 Brien Boise 152 Camilla Castle 167 Ches Bonnell 156 Curtice Advani 153 Dennison Crosswaite 178 Dotty Strutley 190 Gigi Bohling 134 Gunar Cockshoot 155 Husein Augar 153 Jan Morforth 196 Jehu Rudeforth 168 Kaine Padly 161 Karlen McCaffrey 205 Kelci Walkden 161 Madelene Upcott 161 Mallorie Waber 145 Marney O'Breen 178 Oby Sorrel 175 Rafaelita Blaksland 126 Roddy Speechley 160 Van Tuxwell 133 Wilone O'Kielt 118 Name: Boxes Shipped, dtype: int64
Create a bar chart showing the total boxes sold for each choclate product.ΒΆ
InΒ [834]:
total_boxes_by_product = dfChocolateSales.groupby('Product')['Boxes Shipped'].sum().reset_index()
# Visualze the data in a bar chart using Altair
total_boxes_by_product_bar_chart = px.bar(
total_boxes_by_product,
x='Product',
y='Boxes Shipped',
color='Product',
title='Total Boxes Shipped by Product')
total_boxes_by_product_bar_chart.show()
Create Circle Chart showing Total Sales Amount by CountryΒΆ
InΒ [839]:
# Create ne df for chart
total_product_sales_amount_by_country = dfChocolateSales.groupby('Country')['Amount'].sum().reset_index()
# Create Choropleth Map
Total_Product_Sales_By_Country_Map = px.choropleth(
total_product_sales_amount_by_country,
locations='Country',
locationmode='country names',
color='Amount',
hover_name='Country',
color_continuous_scale=px.colors.sequential.Plasma,
title='Total Product Sales by Country'
)
Total_Product_Sales_By_Country_Map.show()
Create line chart looking at total sales over time.ΒΆ
InΒ [836]:
# Create datframe for sales over time
dfsalesovertime = dfChocolateSales.groupby('Date')['Amount'].sum().reset_index()
#
Sales_over_time_line_chart = px.line(
dfsalesovertime,
x='Date',
y='Amount',
title='Sales Over Time',
markers=True)
# Save the chart as an HTML file
Sales_over_time_line_chart